Introduction:¶
Welcome to my exciting journey through the world of finance education on Udemy! In this exploration, we'll delve into a rich dataset containing information about various finance courses offered on the popular online learning platform. By leveraging the powerful tools of Python such as pandas, Matplotlib, and Plotly, we aim to uncover valuable insights and trends within this dataset.
Data Exploration:¶
Data Cleaning:
- Load the Udemy finance course dataset into a pandas DataFrame.
- Identify and handle any missing values or inconsistencies in the data.
- Remove outliers to ensure the integrity of our analysis.
Exploratory Data Analysis:
- Investigate the distribution of course offerings across different categories and topics within finance.
- Analyze the popularity of courses based on factors such as average ratings, number of reviews, and current pricing.
- Explore the relationship between course characteristics and subscriber engagement.
!pip install langdetect
!pip install language_data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import langdetect
import langcodes
Requirement already satisfied: langdetect in c:\users\mehul\anaconda3\lib\site-packages (1.0.9) Requirement already satisfied: six in c:\users\mehul\anaconda3\lib\site-packages (from langdetect) (1.16.0) Requirement already satisfied: language_data in c:\users\mehul\anaconda3\lib\site-packages (1.2.0) Requirement already satisfied: marisa-trie>=0.7.7 in c:\users\mehul\anaconda3\lib\site-packages (from language_data) (1.1.0) Requirement already satisfied: setuptools in c:\users\mehul\anaconda3\lib\site-packages (from marisa-trie>=0.7.7->language_data) (68.2.2)
df_orignal = pd.read_csv("udemy.csv")
df_orignal.sample(5)
| id | title | url | is_paid | num_subscribers | avg_rating | avg_rating_recent | rating | num_reviews | is_wishlisted | num_published_lectures | num_published_practice_tests | created | published_time | discount_price__amount | discount_price__currency | discount_price__price_string | price_detail__amount | price_detail__currency | price_detail__price_string | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7159 | 1817268 | Start Your Own Business with the 9-5 Dropout A... | /course/start-your-own-business-with-the-9-5-d... | True | 1671 | 4.45 | 4.38060 | 4.38060 | 11 | False | 33 | 0 | 2018-07-24T13:32:58Z | 2018-08-29T16:00:44Z | 455.0 | INR | ₹455 | 6400.0 | INR | ₹6,400 |
| 3117 | 1697090 | Microsoft Project 2016 by Abu Bakar - 20 PDU's | /course/msproject2016/ | True | 363 | 4.50 | 4.48216 | 4.48216 | 70 | False | 25 | 0 | 2018-05-15T20:11:39Z | 2018-11-16T20:04:15Z | NaN | NaN | NaN | 12800.0 | INR | ₹12,800 |
| 12874 | 3330606 | Besserer Umgang mit Geld | /course/besserer-umgang-mit-geld/ | True | 142 | 4.25 | 4.30303 | 4.30303 | 2 | False | 24 | 0 | 2020-07-14T09:26:20Z | 2020-07-15T07:24:43Z | 455.0 | INR | ₹455 | 1280.0 | INR | ₹1,280 |
| 8988 | 2382421 | How to Massively Increase Course Sales - Chang... | /course/online-course-sales-one-thing/ | True | 3861 | 4.75 | 4.69269 | 4.69269 | 4 | False | 14 | 0 | 2019-05-24T12:51:53Z | 2019-05-31T17:33:33Z | 455.0 | INR | ₹455 | 8640.0 | INR | ₹8,640 |
| 2638 | 441848 | Business Writing That Gets Results | /course/business-writing-that-gets-results/ | True | 2151 | 3.75 | 3.76465 | 3.76465 | 91 | False | 27 | 0 | 2015-03-08T10:56:54Z | 2015-04-05T16:25:30Z | 455.0 | INR | ₹455 | 1280.0 | INR | ₹1,280 |
df = df_orignal #creating a copy of our dataset
print("Price Detail Currencies:", df['price_detail__currency'].unique())
print("\nValue Counts of Price Detail Currencies:", df['price_detail__currency'].value_counts())
print("\nDiscount Price Currencies:", df['discount_price__currency'].unique())
print("\nValue Counts of Discount Price Currencies:", df['discount_price__currency'].value_counts())
print("\nValue Counts of wishlisted courses: ", df["is_wishlisted"].value_counts())
print("\nValue Counts of practice tests: ", df.num_published_practice_tests.value_counts())
Price Detail Currencies: ['INR' nan] Value Counts of Price Detail Currencies: price_detail__currency INR 13111 Name: count, dtype: int64 Discount Price Currencies: ['INR' nan] Value Counts of Discount Price Currencies: discount_price__currency INR 12205 Name: count, dtype: int64 Value Counts of wishlisted courses: is_wishlisted False 13608 Name: count, dtype: int64 Value Counts of practice tests: num_published_practice_tests 0 13008 2 214 1 202 6 67 5 43 3 39 4 35 Name: count, dtype: int64
df[["discount_price__amount", "discount_price__price_string", "price_detail__amount", "price_detail__price_string"]].sample(5)
| discount_price__amount | discount_price__price_string | price_detail__amount | price_detail__price_string | |
|---|---|---|---|---|
| 12735 | NaN | NaN | 11520.0 | ₹11,520 |
| 2437 | 455.0 | ₹455 | 5760.0 | ₹5,760 |
| 4525 | 455.0 | ₹455 | 8640.0 | ₹8,640 |
| 7476 | 455.0 | ₹455 | 4800.0 | ₹4,800 |
| 10445 | 455.0 | ₹455 | 3200.0 | ₹3,200 |
We observed that the majority of the data pertains to either Indian Rupees (INR) or contains no value (NaN). Since these columns offer little insight into our analysis, we will remove them. Similarly, columns like "id" and "url" do not contribute significantly to our understanding of the data and will be dropped as well.
We observe that the string fields serve as symbolic representations of their corresponding discount_price and price_detail amounts, respectively. Furthermore, these string fields exhibit an equal number of NaN values. Hence, it is appropriate to remove them entirely from the dataset.
Since there are no courses that are wishlisted in this dataset then let's just drop this field. Also, courses that are free have some current price which should have been ideally zero.
Most finance courses available on udemy dont have tests for their subscribers to practice.
Moreover, the "created" and "published_time" fields contain datetime format data rather than the object data type. We'll convert these fields to the appropriate datetime format.
df = df.drop(columns = ["id", "url", "price_detail__currency", "discount_price__currency", "discount_price__price_string","price_detail__price_string", "is_wishlisted"])
df.created = pd.to_datetime(df.created)
df.published_time = pd.to_datetime(df.published_time)
df.rename(columns={'discount_price__amount': 'current_price', 'price_detail__amount': "orignal_price", 'num_subscribers':"subscribers"}, inplace=True)
df.loc[df['is_paid'] == False, 'current_price'] = 0
df.dtypes
title object is_paid bool subscribers int64 avg_rating float64 avg_rating_recent float64 rating float64 num_reviews int64 num_published_lectures int64 num_published_practice_tests int64 created datetime64[ns, UTC] published_time datetime64[ns, UTC] current_price float64 orignal_price float64 dtype: object
df.isnull().sum()
title 0 is_paid 0 subscribers 0 avg_rating 0 avg_rating_recent 0 rating 0 num_reviews 0 num_published_lectures 0 num_published_practice_tests 0 created 0 published_time 0 current_price 907 orignal_price 497 dtype: int64
df = df.interpolate()
df.isna().sum()
C:\Users\Mehul\AppData\Local\Temp\ipykernel_9544\2690831041.py:1: FutureWarning: DataFrame.interpolate with object dtype is deprecated and will raise in a future version. Call obj.infer_objects(copy=False) before interpolating instead. df = df.interpolate()
title 0 is_paid 0 subscribers 0 avg_rating 0 avg_rating_recent 0 rating 0 num_reviews 0 num_published_lectures 0 num_published_practice_tests 0 created 0 published_time 0 current_price 0 orignal_price 0 dtype: int64
df.describe()
| subscribers | avg_rating | avg_rating_recent | rating | num_reviews | num_published_lectures | num_published_practice_tests | current_price | orignal_price | |
|---|---|---|---|---|---|---|---|---|---|
| count | 13608.000000 | 13608.000000 | 13608.000000 | 13608.000000 | 13608.000000 | 13608.000000 | 13608.000000 | 13608.000000 | 13608.000000 |
| mean | 2847.010435 | 3.923293 | 3.912242 | 3.912242 | 243.169827 | 32.224794 | 0.110523 | 476.277741 | 4719.647266 |
| std | 9437.865634 | 1.031304 | 1.039237 | 1.039237 | 1580.965895 | 42.766911 | 0.623501 | 276.566197 | 3206.448904 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1280.000000 |
| 25% | 62.000000 | 3.800000 | 3.787315 | 3.787315 | 7.000000 | 12.000000 | 0.000000 | 455.000000 | 1600.000000 |
| 50% | 533.000000 | 4.194440 | 4.181735 | 4.181735 | 24.000000 | 21.000000 | 0.000000 | 455.000000 | 3200.000000 |
| 75% | 2279.500000 | 4.450000 | 4.452105 | 4.452105 | 87.000000 | 37.000000 | 0.000000 | 455.000000 | 8640.000000 |
| max | 374836.000000 | 5.000000 | 5.000000 | 5.000000 | 78006.000000 | 699.000000 | 6.000000 | 3200.000000 | 12800.000000 |
Content Depth Analysis: Exploring the relationship between the number of published lectures and course ratings. Determining whether courses with a higher number of lectures tend to have better ratings and higher subscriber engagement.
df[["avg_rating", "avg_rating_recent", "subscribers", "num_reviews", "num_published_lectures"]].corr()
| avg_rating | avg_rating_recent | subscribers | num_reviews | num_published_lectures | |
|---|---|---|---|---|---|
| avg_rating | 1.000000 | 0.989222 | 0.082100 | 0.068631 | 0.116836 |
| avg_rating_recent | 0.989222 | 1.000000 | 0.084198 | 0.069347 | 0.113875 |
| subscribers | 0.082100 | 0.084198 | 1.000000 | 0.784190 | 0.211672 |
| num_reviews | 0.068631 | 0.069347 | 0.784190 | 1.000000 | 0.179063 |
| num_published_lectures | 0.116836 | 0.113875 | 0.211672 | 0.179063 | 1.000000 |
There is a very high correlation between different rating metrics but it seems that they do not relate to number of reviews that much. Similarly, number of subscribers is highly correlated with number of reviews.
Now let's create some new fields.
# Defining the bins and labels for the categories
bins = [0, 2.5, 3.9, 4.3, 4.8, 5]
labels = ['Poor', 'Below Average', 'Above Average', 'Good', 'Excellent']
# Creating a new column to categorize courses based on avg-rating
df['rating_category'] = pd.cut(df['avg_rating'], bins=bins, labels=labels, right=False)
df[["avg_rating", "rating_category"]].sample(10)
| avg_rating | rating_category | |
|---|---|---|
| 9013 | 5.00000 | NaN |
| 6195 | 4.30000 | Good |
| 2043 | 3.93182 | Above Average |
| 10180 | 3.75000 | Below Average |
| 11008 | 5.00000 | NaN |
| 13340 | 0.00000 | Poor |
| 4984 | 4.30000 | Good |
| 12133 | 3.80000 | Below Average |
| 11061 | 3.00000 | Below Average |
| 2106 | 4.34328 | Good |
df['total_sales'] = df['subscribers'] * df['current_price']
paid_courses = df[df['is_paid'] == True]
free_courses = df[df["is_paid"] == False]
We can see there are many similar courses so let's fitler them out and keep our courses distinct.
duplicates = df.duplicated(subset=['title'])
print("Number of Duplicate titles:", df[duplicates]['title'].shape)
Number of Duplicate titles: (45,)
df = df[~duplicates]
#checking again to see if it worked
duplicates = df.duplicated(subset=['title'])
print("Number of Duplicate titles:", df[duplicates]['title'].shape)
Number of Duplicate titles: (0,)
Now our data is consistent. Next, let's visualise how many finance courses are paid.
Visualization:¶
Data Visualization:
- Utilize Matplotlib and Plotly to create engaging visualizations that highlight key insights from our data exploration.
- Generate bar charts, scatter plots, and box plots to illustrate trends and patterns in course popularity, ratings, and pricing.
- Ensure that the visualizations are clear, informative, and visually appealing.
Interactive Visualization:
- Harness the interactive capabilities of Plotly to create dynamic dashboards and plots.
- Allow users to interactively explore the dataset, filter courses based on their preferences, and gain deeper insights into finance education trends on Udemy.
fig = go.Figure(data=go.Violin(y=paid_courses['current_price'],x = paid_courses["rating_category"],
box_visible=True,
line_color='blue',
meanline_visible=True,
fillcolor='lightblue',
opacity=0.6,
hoverinfo='y+name'))
fig.update_layout(
title='Distribution of Current Prices Across Rating Categories',
xaxis_title='Rating Category',
yaxis_title='Current Price',
yaxis=dict(range=[400, 700]),
height=400,
width=1000,
)
fig.show()
plt.figure(figsize = (4,3))
# Countplot for paid vs. free courses
sns.countplot(data=df, x='is_paid', width=0.5)
plt.title('Paid vs Free Courses')
plt.xlabel('Course Type')
plt.xticks([0, 1], labels = ['Free', 'Paid'])
plt.show()
# Course Creation or Publication Over Time
# Aggregate the data by year and count the number of courses created or published each year
courses_over_time = df.groupby(df['created'].dt.year)['title'].count()
# Create a line plot to visualize the trend of course creation or publication over time
fig_time_series = go.Figure(data=go.Scatter(x=courses_over_time.index, y=courses_over_time.values,
mode='lines+markers', marker=dict(color='blue'),
name='Courses Created or Published'))
fig_time_series.update_layout(title='Finance Courses Created or Published Over Time',
xaxis_title='Year', yaxis_title='Number of Courses',
height=500, width=900)
fig_time_series.show()
fig = go.Figure(go.Violin(
y=df['num_published_lectures'],
box_visible=True,
line_color='blue',
meanline_visible=True,
fillcolor='lightblue',
opacity=0.6
))
# Update layout
fig.update_layout(
title='Distribution of Number of Published Lectures',
yaxis_title='Number of Published Lectures',
yaxis=dict(range=[0, 100]),
height=500,
width=800
)
# Show plot
fig.show()
# Distribution of Time to Publication
# Calculate the time taken from course creation to publication (in days)
df['time_to_publication'] = (df['published_time'] - df['created']).dt.days
# Create a violin plot to visualize the distribution of time to publication
fig_time_to_publication_violin = go.Figure(data=go.Violin(y=df['time_to_publication'],
box_visible=True,
line_color='blue',
meanline_visible=True,
fillcolor='lightblue',
opacity=0.6,
hoverinfo='y+name'))
fig_time_to_publication_violin.update_layout(title='Distribution of Time to Publication for Finance Courses',
yaxis_title='Time to Publication (Days)',
yaxis=dict(range=[-1, 200]),
height=500, width=900)
fig_time_to_publication_violin.show()
top_paid_courses = paid_courses.nlargest(5, 'subscribers')
top_free_courses = free_courses.nlargest(5, 'subscribers')
fig_paid = go.Figure(go.Bar(
x=top_paid_courses['subscribers'],
y=top_paid_courses['title'],
orientation='h',
marker=dict(color='crimson')
))
fig_paid.update_layout(
title='Top 5 Most Subscribed Paid Courses',
xaxis_title='Subscribers',
yaxis_title=None,
height=400,
width=1100
)
fig_free = go.Figure(go.Bar(
x=top_free_courses['subscribers'],
y=top_free_courses['title'],
orientation='h',
marker=dict(color='olive')
))
fig_free.update_layout(
title='Top 5 Most Subscribed Free Courses',
xaxis_title='Subscribers',
yaxis_title=None,
height=400,
width=1100
)
# Show plots
fig_paid.show()
fig_free.show()
# Sort the DataFrame based on total sales in descending order and select top 7 courses
top_7_courses = paid_courses.nlargest(7, 'total_sales')
# Create a horizontal bar chart to visualize the top 7 courses with the highest sales
fig = go.Figure(go.Bar(
x=top_7_courses['total_sales'],
y=top_7_courses['title'],
orientation='h',
marker=dict(color='crimson')
))
# Update layout
fig.update_layout(
title='Top 7 Courses with the Highest Sales on Udemy',
xaxis_title='Total Sales (INR)',
yaxis_title=None,
height=400,
width=1200,
bargap=0.2
)
# Show plot
fig.show()
fig = px.scatter(df, x='avg_rating', y='num_reviews', color='rating_category', opacity=0.7,
title='Average Rating vs. Number of Reviews',
labels={'avg_rating': 'Average Rating', 'num_reviews': 'Number of Reviews'},
hover_name='title', hover_data={'avg_rating': True, 'num_reviews': True})
# Customize the layout
fig.update_layout(title_x=0.5, height=400, width=1000,
xaxis=dict(showgrid=True),
yaxis=dict(showgrid=True))
fig.show()
C:\Users\Mehul\anaconda3\Lib\site-packages\plotly\express\_core.py:1958: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
Let's visualise courses based on this new field "rating_category"
# Count the number of courses in each rating category
rating_counts = df['rating_category'].value_counts()
# Create a custom color scale
colors = px.colors.qualitative.Set1
# Create a horizontal bar chart
fig = go.Figure(go.Bar(
x=rating_counts.values,
y=rating_counts.index,
orientation='h',
marker=dict(color=colors[:len(rating_counts)]),
))
# Update layout
fig.update_layout(
title='Distribution of Courses by Rating Category',
xaxis_title='Count',
yaxis_title='Course rating category',
height=500,
width=800
)
# Show plot
fig.show()
def create_scatter_plot(df, rating_categories, title):
fig = go.Figure()
for category in rating_categories:
df_category = df[df['rating_category'] == category]
correlation = df_category['num_reviews'].corr(df_category['avg_rating'])
fig.add_trace(go.Scatter(x=df_category['num_reviews'], y=df_category['avg_rating'],
mode='markers', marker=dict(size=8, opacity=0.7),
name=f'{category} courses (Corr: {correlation:.2f})',
text=[f'Corr: {correlation:.2f}' for _ in range(len(df_category))]))
fig.update_layout(title=title, xaxis_title='Number of Reviews', yaxis_title='Average Rating',
height=400, width=900)
fig.show()
create_scatter_plot(free_courses, free_courses['rating_category'].unique(),
'Impact of Reviews on Average Rating for Free courses')
create_scatter_plot(paid_courses, paid_courses['rating_category'].unique(),
'Impact of Reviews on Average Rating for Paid courses')
We can confidently say that number of reviews do not impact average ratings in any way across diferent categories.
# Function to detect language of a title
def detect_language(title):
try:
return langdetect.detect(title)
except:
return 'Unknown'
# Function to map short forms to full names
def map_language(short_form):
try:
return langcodes.Language.make(short_form).language_name().title()
except langcodes.LanguageValueError:
return 'Unknown'
# Apply language detection to title and mapping to language field
df['language'] = df['title'].apply(detect_language).apply(map_language)
language_counts = df['language'].value_counts().reset_index()
language_counts.columns = ['Language', 'Count']
language_counts = language_counts.sort_values(by='Count', ascending=False)
top_languages = language_counts.head(3)
other_languages_count = language_counts['Count'].sum() - top_languages['Count'].sum()
labels_pie = top_languages['Language'].tolist() + ['Others']
values_pie = top_languages['Count'].tolist() + [other_languages_count]
fig_pie = go.Figure(data=[go.Pie(labels=labels_pie, values=values_pie, hole=0.5)])
fig_pie.update_layout(
title='Top 3 Languages in Finance Courses',
width=600,
height=400,
showlegend=True
)
fig_pie.show()
language_counts_treemap = language_counts.iloc[3:]
fig_treemap = px.treemap(language_counts_treemap, path=['Language'], values='Count', title="Let's zoom on 'other'")
# Show treemap
fig_treemap.show()